原文: FIFO is Better than LRU: the Power of Lazy Promotion and Quick Demotion (jasony.me)
June 24, 2023 2023 年 6 月 24 日 by Juncheng Yang
FIFO is Better than LRU: the Power of Lazy Promotion and Quick Demotion
TL;DR Historically FIFO-based algorithms are thought to be less efficient (having higher miss ratios) than LRU-based algorithms. In this blog, we introduce two techniques, lazy promotion, which promotes objects only at eviction time, and quick demotion, which removes most new objects quickly. We will show thatTL;DR
历史上,基于FIFO的算法被认为比基于LRU的算法效率低(具有更高的未命中率)。
在这篇博客中,我们介绍了两种技术,懒惰提升,它只在驱逐时提升对象,以及快速降级,它快速移除大多数新对象。我们将展示:
- Conventional-wisdom-suggested “weak LRUs”, e.g., FIFO-Reinsertion, is actually more efficient (having lower miss ratios) than LRU; 传统智慧建议的“弱LRU”,例如FIFO-Reinsertion(FIFO重新插入),实际上比LRU更有效(具有更低的未命中率);
- Simply evicting most new objects can improve state-of-the-art algorithm’s efficiency. 简单地驱逐大多数新对象可以提高最先进算法的效率;
- Eviction algorithms can be designed like building LEGOs by adding lazy promotion and quick demotion on top of FIFO. 驱逐算法可以像搭乐高积木一样设计,通过在FIFO之上添加懒惰提升和快速降级。
Background 背景
Caching is a well-known and widely deployed technique to speed up data access, reduce repeated computation and data transfer. A core component of a cache is the eviction algorithm, which chooses the objects stored in the limited cache space. Two metrics describe the performance of an eviction algorithm: efficiency measured by the miss ratio and throughput measured by the number of requests served per second.
缓存是一种众所周知且广泛部署的技术,用于加快数据访问速度,减少重复计算和数据传输。缓存的核心组件是驱逐算法,它选择存储在有限缓存空间中的对象。两个指标描述了驱逐算法的性能:效率通过未命中率来衡量,吞吐量通过每秒服务的请求数来衡量。
The study of cache eviction algorithms has a long history, with a majority of the work centered around LRU (that is, to evict the least-recently-used object). LRU maintains a doubly-linked list, promoting objects to the head of the list upon cache hits and evicting the object at the tail of the list when needed. Belady and others found that memory access patterns often exhibit temporal locality — “the most recently used pages were most likely to be reused in the immediate future”. Thus, LRU using recency to promote objects was found to be better than FIFO.
缓存驱逐算法的研究有着悠久的历史,大部分工作都集中在LRU上(即驱逐最近最少使用的对象)。LRU维护一个双向链表,在缓存命中时将对象提升到链表头部,在需要时驱逐链表尾部的对象。Belady等人发现,内存访问模式经常表现出时间局部性——“最近使用的页面最有可能在不久的将来被重用”。因此,使用“最近性”来提升对象的LRU被发现比FIFO更好。
Most eviction algorithms designed to achieve high efficiency start from LRU. For example, many algorithms, such as ARC, SLRU, 2Q, MQ, and multi-generational LRU, use multiple LRU queues to separate hot and cold objects. Some algorithms, e.g., LIRS and LIRS2, maintain an LRU queue but use different metrics to promote objects. While other algorithms, e.g., LRFU, EE-LRU, LeCaR, and CACHEUS, augment LRU’s recency with different metrics. In addition, many recent works, e.g., Talus, improve LRU’s ability to handle scan and loop requests.
大多数旨在实现高效率的驱逐算法都是从LRU开始的。例如,许多算法,如ARC、SLRU、2Q、MQ和多代LRU,使用多个LRU队列来分离热门和冷门对象。一些算法,例如LIRS和LIRS2,维护一个LRU队列,但使用不同的指标来提升对象。而其他算法,例如LRFU、EE-LRU、LeCaR和CACHEUS,用不同的指标增强LRU的最近性。此外,许多近期的工作,例如Talus,提高了LRU处理扫描和循环请求的能力。
Besides efficiency, there have been fruitful studies on enhancing the cache’s throughput performance and thread scalability. Each cache hit in LRU promotes an object to the head of the queue, which requires updating at least six pointers guarded by locks. These overheads are not acceptable in many deployments that need high performance. Thus, performance-centric systems often use FIFO-based algorithms to avoid LRU’s overheads. For example, FIFO-Reinsertion and variants of CLOCK have been developed, which serve as LRU approximations. It is often perceived that these algorithms trade miss ratio for better throughput and scalability.
除了效率,还对提高缓存的吞吐量性能和线程可扩展性进行了富有成果的研究。在LRU中,每个缓存命中都会将对象提升到队列的头部,这需要更新至少六个由锁保护的指针。这些开销在许多需要高性能的部署中是不可接受的。因此,以性能为中心的系统通常使用基于FIFO的算法来避免LRU的开销。例如,已经开发了FIFO-Reinsertion和CLOCK的变种,它们作为LRU的近似。通常认为这些算法为了更好的吞吐量和可扩展性而牺牲了未命中率。
In this blog, I am going to show that FIFO is in-fact better than LRU not only because it is faster, more scalable, but also more efficient and effective (having lower miss ratios).
在这篇博客中,我将展示FIFO实际上比LRU更好,不仅因为它更快、更可扩展,而且更有效、更高效(具有更低的未命中率)。
Why FIFO and What it needs 为什么采用 FIFO 以及它需要什么
FIFO has many benefits over LRU. For example, FIFO has less metadata and requires no metadata update on each cache hit, and thus is faster and more scalable than LRU. In contrast, LRU requires updating six pointers on each cache hit, which is not friendly for modern computer architecture due to random memory accesses. Moreover, FIFO is always the first choice when implementing a flash cache because it does not incur write amplification. Although FIFO has throughput and scalability benefits, it is common wisdom that FIFO is less effective (having higher miss ratio) than LRU. 与 LRU 相比,FIFO 有很多优点。例如,FIFO 具有较少的元数据,并且不需要在每次缓存命中时更新元数据,因此比 LRU 更快且更具可扩展性。相比之下,LRU 需要在每次缓存命中时更新 6 个指针,由于随机内存访问,这对于现代计算机架构并不友好。此外,在实现闪存缓存时,FIFO 始终是首选,因为它不会产生写放大。尽管 FIFO 具有吞吐量和可扩展性优势,但众所周知,FIFO 的效率不如 LRU(具有更高的丢失率)。
(A cache can be viewed as a logically ordered queue with four operations: insertion, removal, promotion and demotion. Most eviction algorithms are promotion algorithms.) (缓存可以看作是一个逻辑上有序的队列,具有四种操作:插入、删除、提升和降级。大多数驱逐算法都是升级算法。)
To understand the various factors that affect the miss ratio, we introduce a cache abstraction. A cache can be viewed as a logically total-ordered queue with four operations: insertion, removal, promotion, and demotion. Objects in the cache can be compared and ordered based on some metric (e.g., time since the last request), and the eviction algorithm evicts the least valuable object based on the metric. Insertion and removal are user-controlled operations, where removal can either be directly invoked by the user or indirectly via the use of time-to-live (TTL). Promotion and demotion are internal operations of the cache used to maintain the logical ordering between objects. 为了了解影响未命中率的各种因素,我们引入了缓存抽象。缓存可以被视为逻辑上全序队列,具有四种操作:插入、删除、提升和降级。可以根据某些指标(例如,自上次请求以来的时间)对缓存中的对象进行比较和排序,并且驱逐算法根据该指标驱逐最不有价值的对象。插入和删除是用户控制的操作,其中删除可以由用户直接调用,也可以通过使用生存时间 (TTL) 间接调用。升级和降级是缓存的内部操作,用于维护对象之间的逻辑顺序。
We observe that most eviction algorithms use promotion to update the ordering between objects. For example, all the LRU-based algorithms promote objects to the head of the queue on cache hits, which we call eager promotion. Meanwhile, demotion is performed implicitly: when an object is promoted, other objects are passively demoted. We call this process passive demotion, a slow process as objects need to traverse through the cache queue before being evicted. However, we will show that instead of eager promotion and passive demotion, eviction algorithms should use lazy promotion and quick demotion. 我们观察到大多数驱逐算法使用提升来更新对象之间的顺序。例如,所有基于 LRU 的算法都会在缓存命中时将对象提升到队列的头部,我们称之为急切提升。同时,降级是隐式执行的:当一个对象升级时,其他对象会被动降级。我们将此过程称为被动降级,这是一个缓慢的过程,因为对象在被逐出之前需要遍历缓存队列。然而,我们将证明驱逐算法应该使用惰性提升和快速降级,而不是急切提升和被动降级。
Lazy Promotion 懒惰提升
To avoid popular objects from being evicted while not incurring much performance overhead, we propose adding lazy promotion on top of FIFO (called LP-FIFO), which promotes objects only when they are about to be evicted. lazy promotion aims to retain popular objects with minimal effort. An example is FIFO-Reinsertion (note that FIFO-Reinsertion, 1-bit CLOCK, and Second Chance are different implementations of the same eviction algorithm): an object is reinserted at eviction time if it has been requested while in the cache. 为了避免活跃的对象被驱逐,同时又不会产生太多性能开销,我们建议在 FIFO 之上添加惰性提升(称为 LP-FIFO),仅在对象即将被驱逐时才提升对象。惰性促销旨在以最小的努力保留受欢迎的对象。一个例子是 FIFO 重新插入(请注意,FIFO-Reinsertion、1-bit CLOCK和Second Chance 是同一逐出算法的不同实现):如果对象在缓存中时被请求,则在逐出时重新插入对象。
LP-FIFO has several benefits over eager promotion (promoting on every access) used in LRU-based algorithms. 与基于 LRU 的算法中使用的急切提升(在每次访问时进行提升)相比,LP-FIFO 有几个优点。
- First, LP-FIFO inherits FIFO’s throughput and scalability benefits because few metadata operations are needed when an object is requested. For example, FIFO-Reinsertion only needs to update a Boolean field upon the first request to a cached object without locking. 首先,LP-FIFO 继承了 FIFO 的吞吐量和可扩展性优势,因为请求对象时只需要很少的元数据操作。例如,FIFO-Reinsertion 只需要在第一次请求缓存对象时更新布尔字段,而无需锁定。
- Second, performing promotion at eviction time allows the cache to make better decisions by accumulating more information about the objects, e.g., how many times an object has been requested. 其次,在逐出时执行升级允许缓存通过积累有关对象的更多信息(例如,对象已被请求多少次)来做出更好的决策。
Trace | approx time 大约时间 | #trace | cache type 缓存类型 | #req (millions) #req(百万) | #obj (millions) #obj(百万) |
---|---|---|---|---|---|
MSR | 2007 | 13 | block | 410 | 74 |
FIU | 2008 | 9 | block | 514 | 20 |
Cloudphysics | 2015 | 106 | block | 2,114 | 492 |
Major CDN 主要CDN | 2018 | 219 | object | 3,728 | 298 |
Tencent Photo 腾讯图片 | 2018 | 2 | object | 5,650 | 1,038 |
Wiki CDN 维基CDN | 2019 | 3 | object | 2,863 | 56 |
Tencent CBS 腾讯哥伦比亚广播公司 | 2020 | 4030 | block | 33,690 | 551 |
Alibaba | 2020 | 652 | block | 19,676 | 1702 |
2020 | 54 | KV | 195,441 | 10,650 | |
Social Network 社交网络 | 2020 | 219 | KV | 549,784 | 42,898 |
To understand LP-FIFO’s efficiency, we performed a large-scale simulation study on 5307 production traces from 10 data sources, which include open-source and proprietary datasets collected between 2007 and 2020. The 10 datasets contain 814 billion (6,386 TB) requests and 55.2 billion (533 TB) objects, and cover different types of caches, including block, key-value (KV), and object caches. We further divide the traces into block and web (including Memcached and CDN). We choose small/large cache size as 0.1%/10% of the number of unique objects in the trace. 为了了解 LP-FIFO 的效率,我们对 10 个数据源的 5307 条生产轨迹进行了大规模模拟研究,其中包括 2007 年至 2020 年间收集的开源和专有数据集。这 10 个数据集包含 8140 亿(6,386 TB)个请求和552亿(533TB)对象,涵盖不同类型的缓存,包括块缓存、键值(KV)缓存和对象缓存。我们进一步将痕迹分为块和网络(包括Memcached和CDN)。我们选择小/大缓存大小为跟踪中唯一对象数量的 0.1%/10%。
We compare the miss ratios of LRU with two LP-FIFO algorithms: FIFO-Reinsertion and 2-bit CLOCK. 2-bit CLOCK tracks object frequency up to three, and an object’s frequency decreases by one each time the CLOCK hand scans through it. Objects with frequency zero are evicted. 我们将 LRU 与两种 LP-FIFO 算法(FIFO 重新插入和 2 位时钟)的丢失率进行比较。 2 位 CLOCK 跟踪对象频率最多为 3,每次 CLOCK 指针扫描时,对象的频率就会减少 1。频率为零的对象被驱逐。
Common wisdom suggests that these two LP-FIFO examples are LRU approximations and will exhibit higher miss ratios than LRU. However, we found that LP-FIFO often exhibits miss ratios lower than LRU. 常识表明,这两个 LP-FIFO 示例是 LRU 近似,并且会表现出比 LRU 更高的丢失率。然而,我们发现 LP-FIFO 的丢失率通常低于 LRU。
Comparison of FIFO-Reinsertion, 2-bit CLOCK and LRU on 10 datasets with 5307 traces. Left two: small cache, right two: large cache. A longer bar means the algorithm is more efficient (having lower miss ratios on more traces). Note that we do not consider the overhead of LRU metadata in all the evaluations. 比较具有 5307 条迹线的 10 个数据集上的 FIFO 重新插入、2 位时钟和 LRU。左二:小缓存,右二:大缓存。条形越长意味着算法越高效(更多迹线的丢失率更低)。请注意,我们在所有评估中都不考虑 LRU 元数据的开销。
The figure above shows that FIFO-Reinsertion and 2-bit CLOCK are better than LRU on most traces. Specifically, FIFO-Reinsertion is better than LRU on 9 and 7 of the 10 datasets using a small and large cache size, respectively. Moreover, on half of the datasets, more than 80% of the traces in each dataset favor FIFO-Reinsertion over LRU at both sizes. On the two social network datasets, LRU is better than FIFO-Reinsertion (especially at the large cache size). This is because most objects in these two datasets are accessed more than once, and using one bit to track object access is insufficient. Therefore, when increasing the one bit in FIFO-Reinsertion (CLOCK) to two bits (2-bit CLOCK), we observe that the number of traces favoring LP-FIFO increases to around 70%. Across all datasets, 2-bit CLOCK is better than FIFO on all datasets at the small cache size and 9 of the 10 datasets at the large cache size. 上图显示,在大多数迹线上,FIFO-Reinsertion 和 2 位 CLOCK 都优于 LRU。具体来说,在使用较小和较大缓存大小的 10 个数据集中的 9 个和 7 个数据集上,FIFO 重新插入优于 LRU。此外,在一半的数据集上,每个数据集中超过 80% 的迹线在两种大小下都支持 FIFO 重新插入而不是 LRU。在两个社交网络数据集上,LRU 优于 FIFO-Reinsertion(尤其是在大缓存大小时)。这是因为这两个数据集中的大多数对象都会被多次访问,并且使用一位来跟踪对象访问是不够的。因此,当将 FIFO 重新插入 (CLOCK) 中的一位增加到两位(2 位 CLOCK)时,我们观察到有利于 LP-FIFO 的迹线数量增加到 70% 左右。在所有数据集中,2 位 CLOCK 在小缓存大小的所有数据集上优于 FIFO,在大缓存大小的 10 个数据集中有 9 个数据集。
FIFO-Reinsertion demotes new objects faster than LRU because objects requested before the new object also pushes it down the queue. FIFO 重新插入比 LRU 更快地降级新对象,因为在新对象之前请求的对象也会将其推入队列。
Two reasons contribute to LP-FIFO’s high efficiency. First, lazy promotion often leads to quick demotion. For example, under LRU, a newly-inserted object G is pushed down the queue only by (1) new objects and (2) cached objects that are requested after G. However, besides the objects requested after G, the objects requested before G (but have not been promoted, e.g., B, D) also push G down the queue when using FIFO-Reinsertion. Second, compared to promotion at each request, object ordering in LP-FIFO is closer to the insertion order, which we conjecture is better suited for many workloads that exhibit popularity decay — old objects have a lower probability of getting a request. LP-FIFO 的高效率有两个原因。首先,懒惰的晋升往往会导致快速降职。例如,在LRU下,新插入的对象G仅由(1)新对象和(2)G之后请求的缓存对象推入队列。但是,除了G之后请求的对象之外,G之前请求的对象(但尚未提升,例如 B、D)在使用 FIFO 重新插入时也会将 G 推入队列。其次,与每次请求时的提升相比,LP-FIFO 中的对象排序更接近插入顺序,我们推测这更适合许多表现出流行度下降的工作负载 - 旧对象获得请求的概率较低。
While LP-FIFO surprisingly wins over LRU in miss ratio, it cannot outperform state-of-the-art algorithms. We next discuss another building block that bridges this gap. 虽然 LP-FIFO 在丢失率方面令人惊讶地战胜了 LRU,但它无法超越最先进的算法。接下来我们讨论弥补这一差距的另一个构建模块。
Quick Demotion 快速降级
Efficient eviction algorithms not only need to keep popular objects in the cache but also need to evict unpopular objects fast. In this section, we show that quick demotion (QD) is critical for an efficient eviction algorithm, and it enables FIFO-based algorithms to achieve state-of-the-art efficiency. 高效的驱逐算法不仅需要将受欢迎的对象保留在缓存中,还需要快速驱逐不受欢迎的对象。在本节中,我们将展示快速降级 (QD) 对于高效驱逐算法至关重要,它使基于 FIFO 的算法能够实现最先进的效率。
Because demotion happens passively in most eviction algorithms, an object typically traverses through the cache before being evicted. Such traversal gives each object a good chance to prove its value to be kept in the cache. However, cache workloads often follow Zipf popularity distribution, with most objects being unpopular. This is further exacerbated by (1) the scan and loop access patterns in the block cache workloads, and (2) the vast existence of dynamic and short-lived data, the use of versioning in object names, and the use of short TTLs in the web cache workloads. We believe the opportunity cost of new objects demonstrating their values is often too high: the object being evicted at the tail of the queue may be more valuable than the objects recently inserted. 由于在大多数逐出算法中降级是被动发生的,因此对象通常会在被逐出之前遍历缓存。这种遍历为每个对象提供了一个很好的机会来证明其要保留在缓存中的值。然而,缓存工作负载通常遵循 Zipf 流行度分布,大多数对象都不受欢迎。 (1) 块缓存工作负载中的扫描和循环访问模式,以及 (2) 动态和短期数据的大量存在、对象名称中版本控制的使用以及短 TTL 的使用进一步加剧了这种情况。 Web 缓存工作负载。我们认为新对象展示其价值的机会成本通常太高:在队列尾部被驱逐的对象可能比最近插入的对象更有价值。
An example of quick demotion: adding a small FIFO to filter most new objects that do not have a request soon after insertion. 快速降级的一个例子:添加一个小的 FIFO 来过滤插入后不久没有请求的大多数新对象。
To illustrate the importance of quick demotion, we add a simple QD technique on top of state-of-the-art eviction algorithms. The QD technique consists of a small probationary FIFO queue storing cached data and a ghost FIFO queue storing metadata of objects evicted from the probationary FIFO queue. The probationary FIFO queue uses 10% of the cache space and acts as a filter for unpopular objects: objects not requested after insertion are evicted early from the FIFO queue. The main cache runs a state-of-the-art algorithm and uses 90% of the space. And the ghost FIFO stores as many entries as the main cache. Upon a cache miss, the object is written into the probationary FIFO queue unless it is in the ghost FIFO queue, in which case, it is written into the main cache. When the probationary FIFO queue is full, if the object to evict has been accessed since insertion, it is inserted into the main cache. Otherwise, it is evicted and recorded in the ghost FIFO queue. 为了说明快速降级的重要性,我们在最先进的驱逐算法之上添加了一种简单的 QD 技术。 QD 技术由一个存储缓存数据的小型试用 FIFO 队列和一个存储从试用 FIFO 队列中逐出的对象元数据的幽灵 FIFO 队列组成。试用 FIFO 队列使用 10% 的缓存空间,并充当不受欢迎对象的过滤器:插入后未请求的对象会提前从 FIFO 队列中逐出。主缓存运行最先进的算法并使用 90% 的空间。 Ghost FIFO 存储与主缓存一样多的条目。当缓存未命中时,对象将被写入试用 FIFO 队列,除非它位于幽灵 FIFO 队列中,在这种情况下,它将被写入主缓存。当试用 FIFO 队列已满时,如果要逐出的对象自插入以来已被访问过,则将其插入到主缓存中。否则,它将被逐出并记录在 Ghost FIFO 队列中。
We add this FIFO-based QD technique to five state-of-the-art eviction algorithms, ARC, LIRS, CACHEUS, LeCaR, and LHD. We used the open-source LHD implementation from the authors, implemented the others following the corresponding papers, and cross-checked with open-source implementations. We evaluated the QD-enhanced and original algorithms on the 5307 traces. Because the traces have a wide range of miss ratios, we choose to present each algorithm’s miss ratio reduction from FIFO calculated as (mrFIFO - mralgo) / mrFIFO. 我们将这种基于 FIFO 的 QD 技术添加到五种最先进的驱逐算法中:ARC、LIRS、CACHEUS、LeCaR 和 LHD。我们使用了作者的开源 LHD 实现,按照相应的论文实现了其他实现,并与开源实现进行了交叉检查。我们在 5307 迹线上评估了 QD 增强算法和原始算法。由于迹线的丢失率范围很广,因此我们选择呈现每种算法从 FIFO 中减少的丢失率,计算公式为 (mr FIFO - mr algo ) / mr FIFO
On the block traces, quick demotion can improve most state-of-the-art algorithm's efficiency. Left: small cache, right: large cache. 在块追踪上,快速降级可以提高大多数最先进算法的效率。左:小缓存,右:大缓存。
On the web traces, quick demotion can improve all state-of-the-art algorithm's efficiency. Left: small cache, right: large cache. 在网络跟踪中,快速降级可以提高所有最先进算法的效率。左:小缓存,右:大缓存。
The figures above show that the QD-enhanced algorithms further reduce the miss ratio of each state-of-the-art algorithm on almost all percentiles. For example, QD-ARC (QD-enhanced ARC) reduces ARC’s miss ratio by up to 59.8% with a mean reduction of 1.5% across all workloads on the two cache sizes, QD-LIRS reduces LIRS’s miss ratio by up to 49.6% with a mean of 2.2%, and QD-LeCaR reduces LeCaR’s miss ratio by up to 58.8% with a mean of 4.5%. Note that achieving a large miss ratio reduction on a large number of diverse traces is non-trivial. For example, the best state-of-the-art algorithm, ARC, can only reduce the miss ratio of LRU 6.2% on average. 上图显示,QD 增强算法进一步降低了每种最先进算法在几乎所有百分位上的失误率。例如,QD-ARC(QD 增强型 ARC)可将 ARC 的缺失率降低高达 59.8%,两种缓存大小上的所有工作负载平均降低 1.5%,QD-LIRS 将 LIRS 的缺失率降低高达 49.6%,平均为 2.2%,QD-LeCaR 将 LeCaR 的失误率降低高达 58.8%,平均为 4.5%。请注意,在大量不同的轨迹上实现大幅的丢失率降低并非易事。例如,最好的最先进算法ARC,平均只能降低LRU 6.2%的丢失率。
The gap between the QD-enhanced algorithm and the original algorithm is wider (1) when the state-of-the-art is relatively weak, (2) when the cache size is large, and (3) on the web workloads. With a weaker state-of-the-art, the opportunity for improvement is larger, allowing QD to provide more prominent benefits. For example, QD-LeCaR reduces LeCaR’s miss ratios by 4.5% on average, larger than the reductions on other state-of-the-art algorithms. When the cache size is large, unpopular objects spend more time in the cache, and quick demotion becomes more valuable. For example, QD-ARC and ARC have similar miss ratios on the block workloads at the small cache size. But QD-ARC reduces ARC’s miss ratio by 2.3% on average at the large cache size. However, when the cache size is too large, e.g., 80% of the number of objects in the trace, adding QD may increase the miss ratio (not shown). At last, QD provides more benefits on the web workloads than the block workloads, especially when the cache size is small. We conjecture that web workloads have more short-lived data and exhibit stronger popularity decay, which leads to a more urgent need for quick demotion. While quick demotion improves the efficiency of most state-of-the-art algorithms, for a small subset of traces, QD may increase the miss ratio when the cache size is small because the probationary FIFO is too small to capture some potentially popular objects. QD 增强算法与原始算法之间的差距更大(1)当 state-of-the-art 相对较弱时,(2)当缓存大小较大时,以及(3)在 Web 工作负载上。在state-of-the-art较弱的情况下,改进的机会更大,从而使QD能够提供更突出的好处。例如,QD-LeCaR 将 LeCaR 的失误率平均降低了 4.5%,高于其他最先进算法的降低幅度。当缓存大小较大时,不受欢迎的对象在缓存中花费更多时间,快速降级变得更有价值。例如,QD-ARC 和 ARC 在小缓存大小的块工作负载上具有相似的未命中率。但 QD-ARC 在大缓存大小下平均将 ARC 的未命中率降低了 2.3%。然而,当缓存大小太大时,例如,跟踪中对象数量的 80%,添加 QD 可能会增加未命中率(未示出)。最后,QD 在 Web 工作负载上比块工作负载提供更多优势,特别是当缓存大小较小时。我们推测,Web 工作负载具有更多的短期数据,并且表现出更强的流行度衰减,这导致更迫切需要快速降级。虽然快速降级提高了大多数最先进算法的效率,但对于一小部分跟踪,当缓存大小较小时,QD 可能会增加未命中率,因为试用 FIFO 太小而无法捕获某些潜在流行的对象。
Although adding the probationary FIFO improves efficiency, it further increases the complexity of the already complicated state-of-the-art algorithms. To reduce complexity, we add the same QD technique on top of 2-bit CLOCK and call it QD-LP-FIFO. QD-LP-FIFO uses two FIFO queues to cache data and a ghost FIFO queue to track evicted objects. It is not hard to see QD-LP-FIFO is simpler than all state-of-the-art algorithms — it requires at most one metadata update on a cache hit and no locking for any cache operation. Therefore, we believe it will be faster and more scalable than all state-of-the-art algorithms. Besides enjoying all the benefits of simplicity, QD-LP-FIFO also achieves lower miss ratios than state-of-the-art algorithms. For example, compared to LIRS and LeCaR, QD-LP-FIFO reduces miss ratio by 1.6% and 4.3% on average, respectively, across the 5307 traces. While the goal of this work is not to propose a new eviction algorithm, QD-LP-FIFO illustrates how we can build simple yet efficient eviction algorithms by adding quick demotion and lazy promotion techniques to a simple base eviction algorithm such as FIFO. 虽然添加试用 FIFO 提高了效率,但它进一步增加了本已复杂的最先进算法的复杂性。为了降低复杂性,我们在 2 位时钟之上添加了相同的 QD 技术,并将其称为 QD-LP-FIFO。 QD-LP-FIFO 使用两个 FIFO 队列来缓存数据,并使用一个 Ghost FIFO 队列来跟踪被逐出的对象。不难看出,QD-LP-FIFO 比所有最先进的算法都要简单——它最多需要一次缓存命中的元数据更新,并且任何缓存操作都不需要锁定。因此,我们相信它将比所有最先进的算法更快、更具可扩展性。除了享受简单性的所有好处外,QD-LP-FIFO 还实现了比最先进算法更低的丢失率。例如,与 LIRS 和 LeCaR 相比,QD-LP-FIFO 在 5307 条迹线上平均分别降低了 1.6% 和 4.3% 的缺失率。虽然这项工作的目标不是提出一种新的逐出算法,但 QD-LP-FIFO 说明了我们如何通过向简单的基本逐出算法(例如 FIFO)添加快速降级和惰性提升技术来构建简单而高效的逐出算法。
Discussion 讨论
We have demonstrated reinsertion as an example of LP and the use of a small probationary FIFO queue as an example of QD. However, these are not the only techniques. For example, reinsertion can leverage different metrics to decide whether the object should be reinserted. Besides reinsertion, several other techniques are often used to reduce promotion and improve scalability, e.g., periodic promotion, batched promotion, promoting old objects only, and promoting with try-lock. Although these techniques do not fall into our strict definition of lazy promotion (promotion on eviction), many of them effectively retain popular objects from being evicted. On the quick demotion side, besides the small probationary FIFO queue, one can leverage other techniques to define and discover unpopular objects such as Hyperbolic and LHD. Moreover, admission algorithms, e.g., TinyLFU, Bloom Filter, probabilistic, and ML-based admission algorithms, can be viewed as a form of QD — albeit some of them are too aggressive at demotion (rejecting objects from entering the cache). 我们已经将重新插入作为 LP 的示例,并使用小型试用 FIFO 队列作为 QD 的示例。然而,这些并不是唯一的技术。例如,重新插入可以利用不同的指标来决定是否应重新插入对象。除了重新插入之外,还经常使用其他几种技术来减少升级并提高可扩展性,例如定期升级、批量升级、仅升级旧对象以及使用尝试锁升级。尽管这些技术不属于我们对惰性提升(逐出提升)的严格定义,但其中许多技术有效地保留了流行对象免于被逐出。在快速降级方面,除了小型试用 FIFO 队列之外,还可以利用其他技术来定义和发现不受欢迎的对象,例如 Hyperbolic 和 LHD。此外,准入算法,例如 TinyLFU、布隆过滤器、概率和基于 ML 的准入算法,可以被视为 QD 的一种形式——尽管其中一些算法在降级方面过于激进(拒绝对象进入缓存)。
Note that QD bears similarity with some generational garbage collection algorithms, which separately store short-lived and long-lived data in young-gen and old-gen heaps. Therefore, ideas from garbage collection may be borrowed to strengthen cache eviction algorithms. 请注意,QD 与一些分代垃圾收集算法相似,这些算法将短寿命和长寿命的数据分别存储在年轻代和老一代堆中。因此,可以借鉴垃圾收集的思想来加强缓存驱逐算法。
The design of QD-LP-FIFO opens the door to designing simple yet efficient cache eviction algorithms by innovating on LP and QD techniques. And we envision future eviction algorithms can be designed like building LEGO — adding lazy promotion and quick demotion on top of a base eviction algorithm. QD-LP-FIFO 的设计通过对 LP 和 QD 技术进行创新,为设计简单而高效的缓存逐出算法打开了大门。我们设想未来的驱逐算法可以像建造乐高一样设计——在基本驱逐算法之上添加惰性提升和快速降级。
Acknowledgement 致谢
There are many people I would like to thank, including but not limited to my co-authors, Carnegie Mellon University Parallel Data Lab (and our sponsors), and Cloudlab. I would also like to give a big shoutout to the people and organizations that open-sourced the traces, without which, this work is not possible and we will NEVER know that CLOCK is better than LRU on every aspect! 我要感谢很多人,包括但不限于我的合著者、卡内基梅隆大学并行数据实验室(以及我们的赞助商)和 Cloudlab。我还要向开源跟踪的人和组织表示大力的赞扬,没有他们,这项工作是不可能的,我们永远不会知道 CLOCK 在各个方面都比 LRU 更好!
This work is published at HotOS’23, more details can be found here, and the slides can be found here. 这项工作发表在 HotOS’23,更多细节可以在这里找到,幻灯片可以在这里找到。
Juncheng Yang 2024